Efficient Substructure Discovery from Large Semi-structured Data

نویسندگان

  • Tatsuya Asai
  • Kenji Abe
  • Shinji Kawasoe
  • Hiroki Arimura
  • Hiroshi Sakamoto
  • Setsuo Arikawa
چکیده

By rapid progress of network and storage technologies, a huge amount of electronic data such as Web pages and XML data [23] has been available on intra and internet. These electronic data are heterogeneous collection of ill-structured data that have no rigid structures, and often called semi-structured data [1]. Hence, there have been increasing demands for automatic methods for extracting useful information, particularly, for discovering rules or patterns from large collections of semi-structured data, namely, semi-structured data mining [6, 11, 18, 19, 21, 25]. In this paper, we model such semi-structured data and patterns by labeled ordered trees, and study the problem of discovering all frequent tree-like patterns that have at least a minsup support in a given collection of semi-structured data. We present an efficient pattern mining algorithm FREQT for discovering all frequent tree patterns from a large collection of labeled ordered trees. Previous algorithms for finding tree-like patterns basically adopted a straightforward generate-and-test strategy [19, 24]. In contrast, our algorithm FREQT is an incremental algorithm that simultaneously constructs the set of frequent patterns and their occurrences level by level. For the purpose, we devise an efficient enumeration technique for ordered trees by generalizing the itemset enumeration tree by Bayardo [10]. The key of our method is the notion of the rightmost expansion, a technique to grow a tree by attaching new nodes only on the rightmost branch of the tree. Furthermore, we show that it is sufficient to maintain only the occurrences of the

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Discovering Frequent Substructures in Large Unordered Trees

In this paper, we study a frequent substructure discovery problem in semi-structured data. We present an efficient algorithm Unot that computes all frequent labeled unordered trees appearing in a large collection of data trees with frequency above a user-specified threshold. The keys of the algorithm are efficient enumeration of all unordered trees in canonical form and incremental computation ...

متن کامل

Discovering Frequent Substructures from Hierarchical Semi-structured Data

Frequent substructure discovery from a collection of semi-structured objects can serve for storage, browsing, querying, indexing and classification of semi-structured documents. This paper examines the problem of discovering frequent substructures from a collection of hierarchical semi-structured objects of the same type. The use of wildcard is an important aspect of substructure discovery from...

متن کامل

Efficient Algorithms for Discovering Frequent and Maximal Substructures from Large Semistructured Data

In this paper, we review recent advances in efficient algorithms for semi-structured data mining , that is, discovery of rules and patterns from structured data such as sets, sequences, trees, and graphs. After introducing basic definitions and problems, We present efficent algorithms for frequent and maximal pattern mining for classes of sets, sequences, and trees. In particular, we explain ge...

متن کامل

Efficient Text and Semi-structured Data Mining: Knowledge Discovery in the Cyberspace

This paper describes applications of the optimized pattern discovery framework to text and Web mining. In particular, we introduce a class of simple combinatorial patterns over texts such as proximity phrase association patterns and ordered and unordered tree patterns modeling unstructured texts and semi-structured data on the Web. Then, we consider the problem of finding the patterns that opti...

متن کامل

Graph-Based Hierarchical Conceptual Clustering

Hierarchical conceptual clustering has proven to be a useful, although under-explored, data mining technique. A graph-based representation of structural information combined with a substructure discovery technique has been shown to be successful in knowledge discovery. The SUBDUE substructure discovery system provides one such combination of approaches. This work presents SUBDUE and the develop...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IEICE Transactions

دوره 87-D  شماره 

صفحات  -

تاریخ انتشار 2002